This document describes how we map the checklist data to Darwin Core. The source file for this document can be found here.

Load libraries:

library(tidyverse)      # Data manipulation
library(obisdi)         # Tools for data ingestion for OBIS
library(here)           # Get paths (important!)
library(arrow)          # To deal with parquet files

1 Read source data

The checklist will be downloaded from FigShare. We use the obisdi function to do the download and also to obtain metadata. Because the files are large, we added a line to control and only download the data once and save the resulting metadata:

# Get the path to data/raw
raw_path <- here("data", "raw")

# See if files were already downloaded
lf <- list.files(raw_path)
if (!any(grepl("figshare", lf))) {
  fig_details <- get_figshare(article_id = 21997559, download_files = T,
                              save_meta = T, path = raw_path)
}

Following the download the details of the dataset can be accessed from the file data/raw/figshare_metadata_09062023.csv.

Title: Global cold-water coral diversity dataset
Authors: Eliza Fragkopoulou, Viktoria Balogh, Ester Serrão, Jorge Assis
Date (dmy format): 22/02/2023
DOI: 10.6084/m9.figshare.21997559.v2
URL: https://figshare.com/articles/dataset/A_comprehensive_dataset_of_the_cold-water_coral_diversity/21997559

2 Preprocessing

First we reduce the size of the raw files by converting them to the parquet format. We keep only the flagged file which is the one that we will include in the OBIS database.

raw_files <- list.files(raw_path, full.names = T)
file.remove(raw_files[-grep("Flagged|metadata", raw_files)])

# We just run the conversion in the first knitting of this document
if (any(grepl("xlsx", raw_files))) {
  flagged <- readxl::read_xlsx(paste0(raw_path, "/databaseFlaggedFinal.xlsx"))
  write_parquet(flagged, paste0(raw_path, "/databaseFlagged.parquet"))
  rm(flagged)
  file.remove(paste0(raw_path, "/databaseFlaggedFinal.xlsx"))
}

Now we can load the parquet file containing the dataset we will work with.

dataset <- read_parquet(paste0(raw_path, "/databaseFlagged.parquet"))
head(dataset)

We will filter the dataset to remove those records that are already available on OBIS. This dataset contains records that werre directly derived from OBIS, but also some from other OBIS derived sources (like MedOBIS). In that case, we will filter by “Ocean Biogeographic Information System” (old name) and “Ocean Biodiversity Information System”.

dataset_filt <- dataset %>%
  mutate(proc_bibliographicCitation = tolower(bibliographicCitation)) %>%
  filter(!grepl("ocean biogeographic information system|ocean biodiversity information system", proc_bibliographicCitation)) %>%
  select(-proc_bibliographicCitation)

This is a dataset that was already standardized to the DwC standard and it’s almost ready to use. However, the MeasurementOrFact column will need to be translated to the right format. We start by creating a new object with the occurrenceID and the MoF column, and then by separating each MoF into a new column.

flags <- dataset_filt %>%
  select(occurrenceID, MeasurementOrFact)

flags <- flags %>%
  separate_wider_delim(cols = 2, delim = ",",
                       names = c("flagLand", "flagVerticalRange", "flagGeographicRange")) 

flags_conv <- flags %>%
  mutate(flagLand = str_remove(str_remove(flagLand, '\\{\\".*\\"\\:\\['), "\\]"),
         flagVerticalRange = str_remove(str_remove(flagVerticalRange, '\\".*\\"\\:\\['), "\\]"),
         flagGeographicRange = str_remove(str_remove(flagGeographicRange, '\\".*\\"\\:\\['), "\\].")) %>%
  mutate(across(2:4, function(x){
    x <- gsub('\\"', "", x)
    as.numeric(x)
  }))

3 Darwin Core mapping

As we said, this dataset is already on the DwC standard, but there are two adjustments that have to be done. First, we need to change the AphiaID column into a scientificNameID column and transform the authority column in the scientificNameAuthorship. We also create a column called scientificName using the acceptedname information.

dataset_filt <- dataset_filt %>%
  mutate(scientificNameID = paste0("urn:lsid:marinespecies.org:taxname:", acceptedAphiaID),
         scientificNameAuthorship = authority,
         scientificName = acceptedname)

We also need to modify the flags object we created because it is not on the STAR format. We adjust that with the following code:

flags_conv <- flags_conv %>%
  pivot_longer(cols = 2:4,
               names_to = "measurementType",
               values_to = "measurementValue")

We can check that the conversion worked by looking into the unique values of each one:

cbind(data.frame(table(flags$flagLand)),
               Freq_conv = data.frame(table(
                 flags_conv$measurementValue[flags_conv$measurementType == "flagLand"]
               ))[,2])
cbind(data.frame(table(flags$flagVerticalRange)),
               Freq_conv = data.frame(table(
                 flags_conv$measurementValue[flags_conv$measurementType == "flagVerticalRange"]
               ))[,2])
cbind(data.frame(table(flags$flagGeographicRange)),
               Freq_conv = data.frame(table(
                 flags_conv$measurementValue[flags_conv$measurementType == "flagGeographicRange"]
               ))[,2])

That’s all we needed to do with the data.

4 Post-processing

As a final step, we remove the MeasurementOrFact column of the dataset_filt object, as this will be supplied to the IPT in a different file. We also remove some columns that can’t be mapped to the DwC and that are not necessary. We also change the coordinateUncertaintyInMeters column to NA values (because here it was used as an indicator of the existence of not of this information - i.e. TRUE/FALSE).

dataset_filt <- dataset_filt %>%
  select(-MeasurementOrFact, -name, -originalSourceType, -sourceBibliographicCitation,
         -depthAccuracy, -acceptedname, -aphiaID, -acceptedAphiaID, -authority, -status) %>%
  relocate(occurrenceID) %>%
  mutate(coordinateUncertaintyInMeters = NA)

And those are the final objects:

dataset_filt
flags_conv

5 Export final files

We can then save the final files:

processed_path <- here("data", "processed")

write_csv(flags_conv, paste0(processed_path, "/extension.csv"))

# Because the occurrence table is large and GitHub have size limitations
# we split the file into multiple files
div <- ceiling(nrow(dataset_filt)/10)
splits <- seq(1, nrow(dataset_filt), by = div)
splits[length(splits)] <- nrow(dataset_filt)
splits <- splits[-1]
st <- 1
for (i in splits) {
  write_csv(dataset_filt[st:i,],
            paste0(processed_path, "/occurrence_part", which(splits == i),".csv"),
            na = "") # We replace NA with empty because the IPT does not interpret NA
  st <- i+1
}
# We can do a quick check that everything went ok:
lf <- list.files(processed_path, "occurrence_part", full.names = T)
parts <- do.call("rbind", lapply(lf, read.csv))

nrow(dataset_filt) == nrow(parts)
## [1] TRUE
all.equal(dataset_filt$occurrenceID,
          parts$occurrenceID)
## [1] TRUE

And we check if the files are saved:

list.files(processed_path)
##  [1] "extension.csv"        "occurrence_part1.csv" "occurrence_part2.csv"
##  [4] "occurrence_part3.csv" "occurrence_part4.csv" "occurrence_part5.csv"
##  [7] "occurrence_part6.csv" "occurrence_part7.csv" "occurrence_part8.csv"
## [10] "occurrence_part9.csv"


Dataset edited by the OBIS secretariat.

The harvesting of this data to OBIS is part of the MPA Europe project.

MPA Europe project has been approved under HORIZON-CL6-2021-BIODIV-01-12 — Improved science based maritime spatial planning and identification of marine protected areas.

Co-funded by the European Union. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or UK Research and Innovation. Neither the European Union nor the granting authority can be held responsible for them


OBIS Data Ingestion | Ocean Biodiversity Information System (obis.org). For more information on how to contribute data to OBIS, see the OBIS manual. Created with the obisdi package.